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Abstract 

This note makes an observation that significantly simplifies previous parallel, two-way merge 
algorithms based on binary search and sequential merge in parallel. First, it is shown that the 
additional merge step of distinguished elements as found in previous algorithms is not neces- 
sary, thus simplifying the implementation and reducing constant factors. Second, by fixating 
the requirements to the binary search, the merge algorithm becomes stable, provided that the 
sequential merge subroutine is stable. The stable, parallel merge algorithm can easily be used 
to implement a stable, parallel merge sort. 

For ordered sequences with n and m elements, m < n, the simplified merge algorithm 
runs in 0{n/p + logn) operations using p processing elements. It can be implemented on an 
FREW PRAM, but since it requires only a single synchronization step, it is also a candidate 
for implementation on other parallel, shared-memory computers. 

Keywords: Parallel merging. Parallel algorithms. Implementation, Shared-memory computers, 
PRAM. 

1 Introduction 

All parallel, two-way merge algorithms (see, e.g., [31 IH El El El [TUl [TI] ) for the case where the number 
of elements in the sequences to be merged is larger than the number of processing elements seem 
to build on the scheme found in, e.g. [TU]: Binary search is used to divide the two input sequences 
into disjoint sequences that can be merged pairwise independently. Binary searches are performed 
in parallel for a small selection of distinguished elements from the two input sequences. A separate, 
parallel merge of distinguished and located elements is needed to determine pairs of subsequences 
to merge. Often, such algorithms are not naturally stable, and take extra space to be made stable. 
This note shows that the parallel merge of distinguished and located elements is not needed, and 
at the same time makes the merge algorithm stable without any additional space or time overhead. 
This significantly simplifies implementation, whether on a PRAM [H [8] or on real hardware with, 
e.g., OpenMP [1]. On an EREW PRAM the simplified algorithm runs in 0{n/p + \ogn) time steps 
where n is the size of the longest input sequence. 
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Figure 1: Two non-decreasing sequences A and with n = 18 and m = 15 elements, respectively, 
divided into p = 5 consecutive blocks. For the starting elements Xi and yj of the blocks the 
corresponding low and high cross ranks are shown, denoted as Xj and yj, respectively. The cross 
ranks from the A array illustrate four of the five cases for the merge step: xq (a), xi and X2 (e), 
X3 (b), and X4 (c). The cross ranks yo and from B illustrate case (d). The algorithm identifies 
the following 2p = 10 merge subproblems of disjoint sequences that can be handled in parallel: 
^[0, . . . , 3] is copied into C[0, . . . , 3], A[A] is copied into C[4], ^[8] is copied into C[14], A[12, . . . , 14] 
and B\7] are merged into C[19, . . . , 22], ^[15] and B[8] are merged into C[23,24] (all by StepEJ; 
and i?[0, . . . , 2] and A[b, . . . , 7] are merged into C[5, . . . , 10], i?[3, . . . , 5] is copied into C[ll, . . . , 13], 
S[6] and j4[9, . . . , 11] are merged into C[15, . . . , 18], i?[9, ...,11] and A[16, 17] are merged into 
C[25,...,29], 5[12, ...,14] is copied into C[30, . . . , 32] (all by StepS]). 

2 The merge algorithm 

Let A and B be two non-decreasing (possibly non-disjoint and allowed to contain repeated elements) 
sequences ordered by a relation < with n and m elements, respectively. Assume without loss of 
generality that m < n. The input sequences are stored in arrays indexed from 0, and a merged 
output sequence is to be delivered in an array C with n + m elements. For convenience, assume 
that ^[—1] = — oo,^[n] = 00, and similarly for array B. An implementation does not have to store 
or reserve space for these sentinel elements, though. 

For some element x and array X let its low rank, denoted rank_low(x, X), be the unique index 
1,0 < i < n such that 

X[i-l]<x< X[i] 
and its high rank, denoted rank_high(x, X), be the unique index j such that 

X[j -l]<x< X[j] 

The low (resp. high) rank of an element a = A[i] (resp. b = B[i]) in the array B (resp. A) 
will be referred to as the cross rank of a (resp. b) in B (resp. A). The correctness of the merge 
algorithm is based on the observation that cross ranks "do not cross". By this, cross ranks from 
A can be used to partition B and vice versa. Consider the cross ranks of two elements A[xi] and 
^[xj_|_i] with Xj < Xj+i in B. The observation below states that the cross rank of any element in 
B between the cross ranks of A[xi] and j4[xj+i] — 1 will be between Xj and Xj+i. Cross ranks for 
selected elements of the two sequences are shown in Figure [TJ 
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Observation 1 Let a = A[i] an element of A and let j = rankJow{a, B) he its cross rank in B. The 
cross rank for any element j' < j that comes before j in B is not after i, i.e., rank_high{B[j'], A) < 
i, and the cross rank for any element j" > j that comes after j in B is strictly after i, i.e., 
rank_high{B[j"], A) > i. In particular, for i' = rank_high{B[j], A) it is the case that i' > i. The 
same holds mutatis mutandis for elements in B. 

To see this, consider the cross rank j = rank_low(a, of an element a = A[i], and let j' < j. 
Since j is the low rank of a in B, B[j'] < B[j — I] < a < B[j], so therefore rank_high(S[j'], A) < i. 
For j < j", a < B[j] < B[j"] and therefore rank_high([i?[j"], ^) > i. 

Both low and high ranks can be computed by suitably modified binary search in O(logn) steps. 
The low rank of a = A[i] from A is the number of elements from B that must come before a in 
a stable merge of A and B in which all (repeated) elements a from A are ordered before elements 
a from B, that is the position of a in the stably merged output sequence is z + rank Jow (A[i], B). 
Conversely, the high rank of 6 = B[j] in A is the number of elements from A that must come before 
6 in a stable merge, that is the position of b in the stably merged output is j + rank_high(i?[j], ^4). 

Now, let p be the number of processing elements. The input sequences A and B are divided 
into roughly equal sized, consecutive, contiguous blocks differing in size by at most one. The first 
r = n mod p blocks of A will get \n/p~\ elements, and the remaining blocks \ n/p\ elements; similarly 
for the B array. The start index Xi of block i m A for each < i < p is determined by 



In addition, define Xp = n. The start indices yi for the p blocks of B are defined similarly. Let 
k be some index in A (resp. B). Index k belongs to block i if either k < r\n/p'\ and L^/T'^/pIJ = 
or A; > r[n/p] and [{k — r\n/p~\)/\n/p\\ + r = i. Computing a block start index Xi or y^ as well 
as determining the block to which a given index k belongs are thus all constant time operations. 
With this, stable, parallel merge is accomplished by the following steps: 

1. Compute Xi = rank_low(j4[xj], B) for < i < phj binary search in parallel. Also, let Xp = m. 

2. Compute yj = rank_high(i?[?/j]. A) for < j < p by binary search in parallel. Also, let yp = n. 

3. Merge disjoint sequences from A and B in parallel by assigning a processing element to each 
Xi for < i < p as follows: 

(a) If Xi = rEj_|_i output A[xi^ . . . , Xj+i — 1] to C[xi + Xj, . . .]. 

(b) If Xi 7^ Xj+i are in the same block j of B, and Xi ^ yj, merge A[xi, . . . , Xj+i — 1] stably 
with B[xi, . . . — 1] and output to C[xi + Xi, . . .]. 

(c) If Xi and Xj+i are in different B blocks, with Xj in block j, Xi ^ yj and Xj+i ^ Uj+i, 
merge A[xi, . . . , yj+i — 1] stably with B[xi, . . . , yj+i — !]■ Output to C[xi + Xi, . . .]. 

(d) If Xi and Xj+i are in different B blocks, with Xj in block j, Xj 7^ yj and Xj+i = yj+i, 
merge A[xi, . . . , x^+i — 1] stably with B[xi, . . . , yj+i — !]■ Output to C[xj + Xj, . . .]. 

(e) If Xi is in block j of B, Xi = yj, and Xj / Xj+i, output A[xi, . . . ,yj — 1] to C[xj + Xj, . . .]. 




i\n/p'\ for i < r 

i\n/p\ + n mod p otherwise 
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4. Merge disjoint sequences from B and A in the same fashion by assigning a processing element 
to each yj for < j < p. 

The results of the binary searches are shown in Figure [1] which illustrates in particular the 
five cases for the merge steps. By the observation on cross ranks, all blocks and sequences are 
disjoint, and obviously partition the arrays. In particular, for case (c), the condition implies that 
Xi+i > Uj+i and therefore it holds that yj < Xj+i, such that the segment A[xi,yj^i — 1] falls entirely 
within block i of A. The exception where Xj+i = yj+i is covered by case (d); by the observation it 
namely holds that yj+i > Xj+i. Correctness is therefore established. Stability in the sense that all 
elements o of A will be placed before elements a of i? is also maintained by the use of low and high 
ranks, provided a stable sequential merge is used. Since all blocks contain 0{n/p) elements and the 
sequences determined by the cross ranks as in the cases (a) to (e) fall entirely within blocks, the 
number of steps for the parallel merge operations with p processing elements is likewise 0{n/p). 
The binary searches are done in parallel and take O(logn) operations. Determining which merge 
case applies entails determining the blocks of Xj and Xj+i and takes 0(1) operations. In summary: 

Theorem 1 Two ordered sequences A and B of length n and m, respectively, with m < n can be 
merged stably in parallel in 0{n/p + logn) operations using p processing elements. Only constant 
extra space in addition to the input and output arrays is needed. 

Synchronization is only required after the two binary search steps, before which the cross ranks 
Xi and yj are conveniently stored inp+1 element arrays. The algorithm can trivially be implemented 
on a CREW PRAM. For implementation on an EREW PRAM, it is first observed that each merge 
of a block from A requires comparing only two locations, namely Xi and Xj+i, so accessing indices Xi 
and Xj+i by the p processing elements in different steps suffices to eliminate concurrent reads. Start 
addresses of the arrays A, B, and C can be copied to the p processing elements in O(logp) steps 
by parallel prefix operations. The parallel binary searches can be pipelined to eliminate concurrent 
reads, and still the p searches can be done in O(logn) parallel time steps. Also this is a standard 
technique. A more processor efficient algorithm for EREW PRAM parallel binary search can be 
found in [2]. 

Overall, the algorithm is a considerable simplification of [5l [10] and other similar merging 
algorithms in that no merging of the sequences of distinguished elements is needed. 

3 Applications and remarks 

The stable parallel merge algorithm can be used to implement a stable, parallel merge sort that 
runs in O (n log n/p + log p logn) parallel time steps for n element arrays. As in |10j this is done 
by first sorting sequentially in parallel p consecutive blocks of 0{n/p) elements, and then merging 
the sorted blocks in parallel in [logp] merge rounds. In round i, i = 1,..., [logp] there are at 
most [p/2*] pairs to be merged and possibly one sequence that just has to be copied. This can be 
accomplished either by grouping the processing elements into groups of 2* consecutively numbered 
elements, or by modifying the merge algorithm to work in parallel on the [p/2*] pairs. The latter 
can easily be accomplished. Thus, a stable merge sort can be implemented with no extra space 
apart from input and output arrays. 

The simplified merge algorithm is likewise useful for distributed implementation, e.g. on a 
BSP as in [4]; here the eliminated merge of p pairs of distinguished elements can save at least one 
expensive round of communication. Details are outside the scope of this note. 
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